Cloud Computing I - Week02 - 2 - Membership

What is Group Membership List?

  • Mean Time to Failure (MTTF)

Target Settings

  • Process 'group' based systems
    • clouds/datacenters
    • replicated servers
    • Distributed DBs
  • crash-stop/fail-stop process failures

Group Membership Service

images02_02_01

  • Membership List is the list of all the processes that are currently running.
  • All the application queries, e.g. gossip, overlays, DHTs etc. keep in sync with this list
  • Membership Protocol governs the Membership list.
  • one of the challenges is that this membership protocol has to communicate, over unreliable medium which can drop or delay the packets.

images02_02_02

  • Strongly Consistent Membership, e.g. computer synchrony, a well known distributed computing paradigm relies on this
  • Partial Consistent List
  • Weakly Consistent
  • Failure Detectors + Dissemination

Failure Detectors

Disitributed Failure Detectors: Properties

  • Completeness
    • the failure should be detected eventually (that means, there is no time bound)
  • Accuracy
    • there should be no false positive
  • Speed
  • Scale
  • Completeness and Accuracy is impossible together in lossy networks.

Failure Detector Properties

Completeness Guranteed(almost always 100%)
Accuracy Partial/Probabilistic gurantee(<100%)
Speed Time
Scale No Bottlenecks/Single Point of Failure
Equal Load on each member
Network Message Load

Centralised Heartbeating

  • pi sends periodic heartbeat signals to pj
  • Heartbeat is a no. containing sequence no.

    images02_02_04

Ring Hearbeat

  • sends heartbeats to both the left and the right neighbors
  • quality of heartbeat is same , sequence no.
  • Failure Condition
    • if there are multiple failures they may go undetected

images02_02_05

All-to-All Heartbeat

  • heartbeat is sent to all the processes
  • equal load per member
  • it is complete
  • problem:
    • suppose there is one node pj, which is slow,
    • it may mark all the nodes as failed

images02_02_06

Gossip-Style Membership

  • a variant to all to all heartbeating, just more robust
  • it has good accuracy properties

images02_02_07

Gossip Style Failure Detection

  • if the heartbeat has not increased for more than Tfail seconds, the member is considered failed
  • and after Tcleanup seconds, it will delete the member from the list
  • Why do we have 2 times?
    • because it is possible, than one node has deleted it's entry while other hasn't
    • so, that deleted entry may get added again

Ananlysis/Discussion

  • What happens if gossip period Tgossip is decreased?
  • A single heartbeat takes O(log(N)) time to propagate

images02_02_08


Which is the best failure detector?

Completeness Guranteed always
Accuracy Probability PM(T)
Speed T Time units
Scale
Equal Load on each member
Network Message Load


N*L compare this across platforms

All to all heartbeating

images02_02_09 in case of NORMAL ALL-to-ALL HEARTBEATING

images02_02_10 in case of Gossip-Based ALL-to-ALL HEARTBEATING

  • Gossip has higher load than the normal one

The best/optimal we can do!

  • worst case load L* (per member), as a function of T, PM(T), N
  • Independent Message Loss Probabiliity pml

images02_02_11 not dependent on N


  • The problem is that, Gossip based is trying to do both Failure Detection and Dissemination* together.
  • So, the KEY is
    • Separate the 2 components
    • Use a non heartbeat-based Failure Detection Component

Another Probabilistic Failure Detector


Dissemination and suspicion

results matching ""

    No results matching ""